COCOKBNT BBSOHB 



ED 121 845 



TH 005 277 



inTHOE 
TITLE 

POB DITE 
NOTE 



EDES PEICE 
DESCRIPTOES 



Sovinellif Bichard J.; Hambleton, Bonald K. 

On the Use of Content Specialists in the Assessment 

of Criterion*Hef erenced Test it^m Validity. 

CApr 76] 

37p.; Paper presented at the Annual Meeting of the 
American Educational Research Association (60th, San 
Francisco, California, April 19-23, 1976) 

HP-$0.83 HC-$2.06 plus Postage 

♦Content Analysis; *Criterion Referenced Tests; Data 
Collection; Evaluation Hethods; *Item Analysis; 
Statistical Analysis; *Test Construction; *Test 
validity 



ABSTRACT 

Essential for an effective criterion*referenced 
testing program is a set of test items that are "valid'' indicators of 
the objectives they have beep designed to measure. Unfortunately, the 
complex matter of assessing item validity has received only limited 
attention from educational measurement specialists. One promising 
approach to the item validity question is through the collection and 
analysis of the judgments of content specialists. The purpose of this 
paper are twofold: First , several possible rating forms and 
statistical methods for the analysis of content specialists* data are 
discussed. Second, the results of item validation work with science 
teachers and three of the more promising rating forms are presented. 
The overall results of the study clearly support the recommendation 
for expanded use of content specialists* ratings is the item 
validation process. (Author/RC) 



* Documents acquired by ERIC include many informal unpublished * 

* materials not available from other sources. ERIC makes every effort * 

* to obtain the best copy available. Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the quality * 

* of the microfiche and hardcopy reproductions ERIC makes avail^tble ^ 

* via the ERIC Document Reproduction Service (EDRS). EDRS is not * 

* responsible for the quality of the original document. Reproductions ^ 

* supplied by EDRS are the best that can be made from the original. * 



On the Use of Content Specialists In the Assessment 
of Criterion-Referenced Test Item Validity 



Richard J* Rovlnelll 
National Board of Medical Examiners 

and 

Ronald K. Hambleton 
University of Massachusetts > Amherst 



Abstract 



Essential for an effective criterion-referenced testing 
program is a sei of test Items that are "valid" Indicators of 
the objectives they have been designed to measure. Unfortunately, 
the complex matter of assessing Item validity has received only 
limited attention from educational measurement specialists* One 
promising approach to the item validity question Is through the 
collection and analysis of the judgements of content specialists. 
The purposes of this paper are twofold: First, we will discuss 
several possible rating forms and statistical methods for the 
analysis of content specialists* data* Second, we will present 
the results of our Item validation work with science teachers and 
three of the more promising rating forms. Tbe overall results of 
the study clearly support the recommendation for expanded use of 
content specialists* ratings In the Item validation process. 

i 



O S. D6FARTMENTOF NtALTN, 
E DUCATION A WELFAIte 
NATIONAL tNWITOTt OF 
EDCCATION 

TNIS OOCUWENT NA$ SEEN REPRO. 
OUCED EXACTLY AS R€CEJV€0 FROM 

tnS person or Organization origin. 

ATING IT POINTS OF V»€WOr OPINIONS 
STATED 00 NOT NECESSARILY RSPttE. 
SENTOPFiclAL NATIONAL INSTJTl>T€ OP 
EDUCATION POSITJON OR POLICY 



On the Use of Content Specialists in the Assessment 
of Criterion-Referenced Test Item Validity ^ 



Richard J. Rovinelli 
National Board of Medical Examiners 

and 

Ronald K. Hambleton 
University of Massachusetts, Amherst 



The amount of interest and energy that has been expended in the area 
of criterion*-ref erenced testing and measurement in the last few years has 
been impressive. A wide variety of theoretical and practical problems 
have received considerable attention from educational measurement special- 
ists (see for example, Fremer, 1972; Hambleton & Novick, 1973; Livingston, 
1972; Millman, 1974; and Popham & Nusek, 1969), Considering its impor- 
tance, educational measurement specialists have given relatively little 
attention to the problem of item validation, i.e., the problem concerning 
the extent to which items are measures of th^ jectives they have been 
designed to measure. 

The problem of item validation is of particular importance with cri- 
terion-referenced tests because of the way the test score information is 
used. The success of an individualized program depends to a considerable 



1 A paper presented at the annual meeting of AERA, San Francisco, 
1976, The paper has been published as Laboratory of Psycho^ 
metric and Evaluative Research Repo^ - t No.24 > Amherst, Mass: 
The University of Massachusetts, 1976. 



ERIC 



3 



-2- 



extent upon how effectively teachers make decisions concerning student, 
mastery of specific instructional objectives. Unless one can say with 
a high degree of confidence that the items in a criterion-referenced 
test measure the instructional objectives, any use of the test informa- 
tiop for instructional decision-making is questionable. 

To date» the two most r.>pular approaches to the problem of assessing 
item validity have been tbrough the use of item generation rules (Hively, 
et al > > 1973) and the empirical analysis of examinee test data.* Relative 
to the first of these approaches, while the use of item generation rules 
is intuitively appealing and represents an excellent solution when the 
rules can be applied, at the present time it would seem that the approach 
is not practical in content areas besides mathematics. Relative to the 
second approach, while the use of a variety of empirical methods on exam- 
inee test data have been popular among criterion^ref ^renced test develop- 
ers, at best this approach provides only partial data for the determina- 
tion of item validity (Mil3t:ian, 1974; Rovinelli, 1976). A third approach 
to the problem, which has received very little attention from test develop- 
ers, is the use of the judgements of content specialists. However, before 
this approach can become a practical solution to the problem of assessing 
item validity, there is a need for the generation, organization and com- 
parative analysis of possible data collection techniques and methods of 
analysing content specialists' ratings. 

Purposes of the Study 

In spite of the importance of the item validity problem to the cri- 
terion-referenced testing area, to date there does not exist a method- 



4 



-3- 



ology for conducting Item validation studies. What does exist Is a dls- 

organize^ set of techniques that address different aspects of the Item 

/ 

validity problem. As recently as 1974, Popham posed two Important ques- 
tions that still remain for criterion-referenced test developers: 

1. What techniques can be devised which will permit objec- 
tive-based test developers to improve their instruments 
on the basis of empirical tryouts in the same ways that 
conventional test developers have been doing for years 
(e.g., total test reliability, item reliability, Item 
homogeneity, objective-item congruence)? 

2* Are there technical rules which can be produced to aid 
reviewers in judging the congruence between test Items 
and the objectives on which they are based? 

^ Further, Skager (1974) adds the following important questions; 

1. How does one establish the fact that items in the pool 
measuring any objective are valid In the sense of being 
(a) congruent with the objective, e.g., actually measur- 
ing the performance described in the objective and (b) 
comprehensive in the sense of providing adequate cover- 
age of the domain specified by the objective? 

2. How does one identify poorly written items by means of 
Item analysis procedures when the frequency of correct 
response may be extremely high or low, accurately reflect- 
ing the achievement status of a particular group of 
learners? 

Given the Importance of the item validity question and the shortage 
of research on the use of content specialists* ratings, this study was 
designed to achieve two purposes; 

1, To generate and to organize appropriate judgemental data 
techniques and methods of data analysis and reporting, 

2. To examine three different techniques for the collection 
of judgemental Information with regard to the type, reli- 
ability, and validity of the Information provided. 



An Organization of Item Validation Approaches 

We feel that It Is useful to organize existing Item validation 
methods around three rather different approaches: Item generation rules. 



-4- 



empirical methods, and the use of content specialists* ratings. 

Through the use of item generation rules, one attempts to ensure 
item validity by developing a direct relationship between an item on 
objective during the item construction phase (Anderson, 1972; Bormuth , 
1970; Hively et al > , 1968, 1973; Millman, 1974). As such, it is an 
£ priori approach as compared to the other a posteriori procedures which 
are designed to assess whether or not a direct relationship between an 
item and an objective exists through analyses of data conducted after 
the item is written. However, the use of item generation rules as cur- 
rently formulated contain inherent problems which make their implementa- 
tioii in many objectives^based programs impractical. 

The second approach, the use of empirical procedures (for example, 
see Popham, 1971; Brennan and Stolurow, 1971) has been very popular but 
there remain many problems. For example, 

1. The procedures are dependent upon the characteristics of 
the group of examinees and the effects of instruction. 

2. They often require sophisticated statistical techniques 
and/or computer programs which are not available to the 
practitioner. 

3. When item statistics derived from empirical analyses of 
test data are used to ''select" the items for a criterion- 
referenced test, the test developer runs the risk of 
obtaining a non-representative set of items from the 
domain of items measuring the objectives included in the 
test. 

4. Etnpirical methods in many instances require pre-test and 
post-test data on the same test items and this is rarely 
done in classroom settings. 

In situations where a large sample of examinees is available and where 

the test constructor is interested in identifying aberrant items, not 

for elimination from the item pool but for correction, the use of an 

empirical approach to item validation should provide important inform^ 

6 



-5- 



ation with regard to the assessment of item validity* 

The third approach > the use of the judgements of content special- 
ists^ appears to offer considerable promise as a means for assessing item 
validity. The approach is not dependent on group composition or instruct- 
ional effects; may not require sophisticated statistical techniques; is 
not restricted to highly structured content domains; and can be imple- 
mented easily in practical settings^ 

ft 

A Methodol(^y for the Use of Content Specialists' Ratings 

The first step in the development of a methodology for the use of 
the judgements of content specialists to assess item validity is to clear- 
ly delineate the important issues. Five of tbe most important issues are; 

1» Can the content specialists make meaningful (valid) 
judgements about the relevance of iteiD:^ to instruct- 
ional content? 

2^ Is there agreement amongst the ratings of content 
specialists? 

3. What information is one seeking to obtain from the 
judgemental data? 

4* What variables effect the judgemental techniques? 

5* What techniques can be used for collecting content 
specialists' ratings of test items? 

Only the second question above has received serious attention. With 
respect to the other four issues, we have little information and few 
clear guidelines. 

The first question concerning the ability of content specialists 
to make meaningful judgements was examined by Hyan (1968). He request- 
ed four judgements for each test item* These judgements were: 

A. How good or poor is the item for determining knowledge 
and understanding of the instructional content pre- 
sented in each of your classes? 

Very poor Poor Fair Good Very good 

7 

ERIC 



B, What proportions of pupils in each class will answer 
the item correctly? 

0 .20 .40 .60 ,80 , 1.00 

C, How much better will the most proficient third of the 
pupils in eacb class do on the item compared to the 
least proficient third? 

Same Slightly better Somewhat better 

Much better Very much better 

D, How appropriate or relevant is the item for the instruc- 
tional materials and content presented in each class? 

Not relevant Somewhat relevant Quite relevant .Very relevant 
Ryan (1968) concluded that teachers can make judgements about test 
items on two dimensions: <1) the relevance of the items to the instruc" 
tional content; and <2) the difficulty of the item. He based his conclu- 
sions on results which showed a "relatively higher frequency with which 
relevance as compared to judged difficulty was correlated with overall 
quality and the relatively higher frequency with which judged difficulty, 
as compared to relevance^ was correlated with actual difficulty." 

While Ryan's (1968) study is a step in the right direction, his con- 
elusions on the issue or relevance is weakly supported in that one does 
not know whether the teachers perceived the judgement of quality the same 
as a judgement of relevance. On the other hand, the judgement of dif~ 
ficulty correlated highly with actual difficulty which gives a more con- 
ventional substantiation of judgemental validity. 

The second question concern^ing the consistency of agreement amongst 
the content specialists, i.e., the reliability of the ratings, has been 
examined by a number of researchers <Lu, 1971; Cohen, 1960; Light, 1971 
Fleiss, 1971; and Brennan and Light, 1973). It is not our intention to 
review this extensive literature here. However, a description of one 

8 



-7- 



prominent method for assessing agreement amongst content specialists 
will follow* 

Lu (1971) presented a method by which one can ascertain the inten- 
sity of agreement amongst judges to an instrument requiring a classifi- 
cation of items into a set of ordered categories. The observed results 
of such a rating procedure is given ^s follows: 



Judges 



Items 







'2 


'3 






^11 








* 


• 


• 


^2^ 


t 
t 


si 

• 




Si2 


^id 


* 


Sn 




^n2 




* 

^^nm 



where Xj^j represents the category assignment on the ith item by the jth 
judge. We will assume that the rating scale consists of t ordered cate*- 
gories. 

Lu derived a set of weights for each category "based on a transform- 
ation from the data's own distribution," These weights are derived from 
the following array; 



Judges 



Categories 





"11 


"12 




"it 


J2 


"21 


"22 


"2k 


"2t 


'3 


"Jl 


"d2 


"jk 






ml 


"in2 


"mk 


n 

mt 



9 



-8- 



where n is the number of items placed in the kth category by the jth 
judge. The scoring weight yj^, for the kth category is defined as: 



k-1 

yfc " r Pr ^ Pk * k = 1,2, t 
r=l 



where p^. = nr./mn 



An analysis of variance is then performed on the transformed 
data. The coefficient of agreement A is calculated as 



A = 2 



where is the expected within subject variance under the 
condition that all ratings are equally likely, and 

2 

is the observed within subject variance. 
A test of the significance of A is conducted indirectly by 
testing '*the hypothesis that the assignments of subjects to the 
categories by the judges are equally likely for all categories." 
That is 

H 

The statistic ^ 

e ^ ^ is x^/<if distributed 

is x^df distributed with n(m-l) degrees of freedom. If the 
hypothesis is rejected, one can conclude that A is significantly 
different from zero (Lu, 1971) » 



10 



-9- 



The third question relates to the information which one seeks to ^ 
obtain from the Judgements of content specialists with regard to deter- 
minima item validity. It would seem that such judgements should provide 
two categories of information; (i) information which is considered 
essential; and (2) information which is considered important. Under the 
first category there are two types of information which must be collected. 
These types are given as follows: 

1. Information relating to whether or not an item is judged 
to be a measure of an objective, 

2. Information relating to whether or not an item is judged 
to be a measure of more than one objective. 

The choice ^T^^^^^pe^of information which is to be collected under 
the second category will vary from study to study as the choice is depend- 
ent on secondary goals or methodological considerations. Examples of 
secondary goals would be the determination of whether or not the content 
specialists can judge the difficulty of the items or whether the items 
were well written. An example of a methodological consideration would be 
the collection of data which would help validate the rating instrument. 

The fourth question concerning the variables which effect the judge- 
ments of content specialists is particularly important. In comparing 
methods for judging the similarity of personality inventory itemst Girard 
and Cliff (1973) found that "the criteria by which subjects were instruc- 
ted to judge similarities between items in a pair made a large difference 
in the judgements.*' Four of these variables which are felt to be import- 
ant are: 

1. Judgemental Procedures: Mienever possible , one should use 
the simplest of techniques available to collect data. For 
example, usually, categorical judgements obtained from 
sorting, rating and ranking procedures are less complex 
than comparative judgements obtained from similarity, dis- 
similarity or choice procedurer;. 



11 



-10- 



2. Format of Presentation: The response task should not 
be tedius and time consuming. For example, while there 
are methods which can be used to reduce the number of 
required responses (Torgeson, 1958), generally the 
method of paired comparisons should be avoided If the 
number of stimuli (items) is large, because of the 
great number of responses involved. 

3. Definition of Task: When describing the response task, 
one should ensure that all the judges are operating under 
the same assumptions. If one merely asks the judges to 
rank or choose items according to personal preference, 
the judges could obtain significant results oased not 

on real differences in the items but on the dimension 
of preference. For example, the Judges could have been 
ranking the items on any one of the following levels 
of the preference dimension: 

A. Simplicity/Complexity of item, 

B. Closeness of match to hypothetical objective, 

C. Response mode required, 

D. Style in which the item was written. 

The directions relating to the response task must clearly 
define the criteria on which the choices are to be made. 

4. Settings for data'collection: In choosing an instrument for 
collecting the Judgements of content specialists, the 
setting in which the data is to be collected must be 
taken into consideration. That is, the practicality of 
its use in both research and non-research settings is a 
key factor in the choice of instrument. 

The fifth question outlined at the beginning of this section is con- 
cerned with the choice of instrument which will be used to collect the 
judgemental data. It is suggested that the test developer choose a tech* 
nique which conforms as closely as possible to the guidelines set forth 
under the discussions of questions 1, 2 and 4 above, while at the same 
time, providing the inform^', .on described in question 3. 



Judgemental Techniques 

Three techniques for the collection and analysis of the judgements 
of content specialists will be described in this section. These tech* 
niques were chosen primarily to provide information on the efficacy of 



12 



-li- 



the use of content specialists as a means for assessing item validity 
and not to provide a definitive answer to the question of which tech- 
niques are most appropriate* 

(a) An Index of Item Homogeneity 

Henipliill and Westie (1950) developed an index of homogeneity of 
placement for use in constructing personality tests* This index is a 
numeric representation of the judgement of content specialists on the 
extent to which they feel that an item belongs to one and only one per** 
sonality dimension* By substituting "objective" for "personality dimen- 
sion", the Index of Item Homogeneity cnn be used in item validation work. 

According to Hemphill and Westie (1950) 

This index was adopted to give a single numerical evalua- 
tion of each item with respect to its h» ,«ogenelty . Agreement 
an^ong judges that the item applied to a din^ension and agreement 
that it did not apply to other dimensions in the description 
were given approximately equal weight in the value of this index. 

The index of "homogeneity of placement" differs in two ways 
from certain other techniques for examining item content. First 
it is based on "expert" judgement of probable response to the 
items, not on actual item response data, Second, unlike indices 
such as "internal consistency/' "liomogeneity or "unidimension- 
ality" all of which refer to relationship among items, the index 
of "homogeneity of placement" involves both relationships among 
items (as reflected by judge agreement that certain items supply 
to the same dimension) and independence of relationship of the 
item to other dimensions making up the same general heuristic 
system* 

The index appears to be a valid procedure for collecting and -analyz- 
ing judgemental data on item validity* 

The mechanics for collecting data through the use of the Hemphill- 
Westie consists of having t*^e content specialists rate each item on each 
of the objectives by assigning a value of +1, 0. or -1, The three pos- ^ 
sible ratings have the following meaning; 



13 



-12- ■ -V . 

+1 ~ definite feeling that an item is a measure of an objective 

0 = undecided about whether the item is a measure of^an objective 
-1 = definite feeling that an item is not a measure of an objective. 
The formula presented by Hemphill and Westie (1950) to compute the 
index of homogeneity of placement is given as follows; 



n n 



N n 

2 2n(N-l) + 2 2 X, „ - 2 X 



where 

IjLk is Index of Homogeneity for item k on objective i, 

N is the number of objectives (i=lt..,N) i 

n is the number of content specialists <j=l,..,n) 

^ijk rating (1,-1 or 0) of item k as a measure of 

objective i by content specialist j. 

While the Hemphlll-Westie procedure is conceptually appropriate for 
the task of collecting judgemental data from content specialists for the 
purpose of assessing item validity, the computational formula given above 
has some serious deficiencies. First, the maximum and minimum values 
are »67 and 40, respectively - (The maximum value of this index occurs 
when each content specialist assigns a +1 to the item for the appropri- 
ate objective and a -1 for all the other objectives* The minimum value 
occurs when content specialists assign a -1 to the item for the approp- 
riate objective and a +1 for all the other objectives.) For ease of inter- 
pretation it is convenient if the range of the index is from -1 to +1, 
Second, and an even more serious problem with the indei^ is that its value 
varies as a function of the number of content specialists and objectives, 
clearly an undesirable situation since it complicates the problem of 



14 



-13- 



interpreting the index. 

Given the above deficiencies ^ we have developed a new computational 
formula for providing a numerical representation of Hemphill-Westie data. 
This new formula will be called the Index of Item-Objective Congruence . 
The assumptions under which this index was developed are: 

1. That perfect item objective congruence should be repre- 
sented by a value of +1 and will occur when ill the 
specialists assign a +1 to the item for the appropri- 
ate objective and a -1 to the item for all the other 
objectives. 

2. That the worst judgement an item can receive should be 
represented by a value of -1 and will occur when all the 
specialists assign a -1 to the item for the appropriate 
ol^(ective and a +1 to the item for all the other objectives. 

3. That the assignment of a 0 to an item is poorer than a +1 
but better than a -1, This is in effect saying that it 
is better for a specialist to not be able to definitely 
decide whether an item is a measure of an appropriate 
objective than it is for the judge to feel that the item 
is definitely not a measure of the objective. 

4. That this index should be invariant to the number of content 
specialists and the number of objectives. 

The new cotnputational fortnula is 

n N n 

(N-1) X,., - E I 

j-1 ^^^^ 1-1 j-1 



2 (N'-l) n 

( All variables on the right-hand side of the expression have the 
same meaning as in the Index of Homogeneity.) 

The choice of a cutoff score for this index to separate "good" from 
'*bad*' items can be based on some absolute standard relating to specific 
proportions of perfect ratings for the items. For example* if one-half 
of the content specialists judged an item to be a perfect match to an 
objective, while the others were not able to make a decision, the computed 



lo 



"14- 



valuc of the index would be .50. lluis^ test constructors obtainiiifj 
r Values of .50 would know that at a niinimunif at least 50 percent 
of the content specialists gave a perfect rating to the item. 

As with the Hcmphill-Westie Index there is no means for determin- 
ing the statistical significance of the values for the Index of Item- 
Objective Congruence. However, the use of Lu's coefficient of agreement 
amongst the judges will give an indication of how reliable (or consistent) 
the judgements are. This indication of consistency of judgements along 
with the known values that the index would take with specific proportions 
of perfect rating will give the test constructor a very good idea as 
to how meaningful a particular I' value is for an item. 

(b) Semantic Differential Technique 

The second procedure employs the use of the semantic differential 
procedure (Osgood, Suci and Tannenbaum, 1957). The content specialists 
are presented with an objective and all the items on whiich ratings are 
desired. They are asked to make a judgement which consists of deciding 
\vhether the item objective relationship is best described by the adjec*- 
tive toward the left end or toward the right end of the scale* 

The following is an example consisting of one objective, one item 

and two adjective scales along with a set of typical directions: 

Objective: Given the chemical formula for a molecule, determine 
the number of atoms in a molecule* 

Item 1: How many atoms are there in a molecule of sulfuric 
acid HgSO^? 

Directions 

Given the objective and item above^ your task is to make judge*- 
ments on the relationship between it and the item on the adjective 
scales indicated below. 

Scale 1; very no 

relevant relevant feeling 

very 

irrelevant irrelevant 

ERIC 



15-" 



Scale 2: 



very 

important 



unimportant 



no feeling 



important 



very 

important 



The data obtained from the use of this technique can be analyzed 
without employing any elaborate statistical procedures. Therefore, 
it can easily be used in practical settings such as in the classroom 
by teachers. The Information which is needed is the scale mean score 
for each objective. However, the data also lends itself to more elabor- 
ate statistical analysis if required. An examination of the standard 
deviations of the scores given each item on each of the scales will pro- 
vide an indication of the extent of agreement among the content special- 
ists. 

(c) A Matching. Procedure 

A third procedure used to obtain the judgements of- content special- 
ists involves the use of a matching task. The content specialists are 
presented with two lists. The first list contains a set of items. Tbe 
second list is a set of objectives. The content specialists match items 
to objectives that they feel they measure. A contingency table can then 
be constructed to represent the number of times each item is assigned to 
each objective across the content specialists. A visual examination of 
a contingency table will provide information concerning the deviant items. 
Statistical tests can also be done. 

An Empirical Study of Several Judgemental Methods 

In this section, two studies used to collect the judgements of 
content specialists on items from two different tests designed to mea- 
sure performance on an individualized science learning package will be 
described. In Study One, twenty-one science teachers were administered 



17 



-le- 



an item validation questionnaire which was designed to determine the 
extent to which they thought a get of items were measures of the intend- 
ed objectives. Tbe teachers (or content specialists as we will refer to 
them) were asked to make judgements on forty items and eleven objectives 
using the Heraphill-Westie categorizing technique. TabJLe^^ contains the 
expected match between the items and objectives. 

In Study Two, a more complex research design and item validation 
questionnaire were used to obtain the judgements of content specialists 
on a set of forty-eight science items and twelve science objectives. 
The twelve instructional objectives and their matched items (see Table 2) 
were divided into three subgroups. Each of these subgroups (denoted sub- 
group one, two, and three) consisted first of four objectives and their 
four corresponding items for a total of 16 test items. Next, two addi- 
tional objectives from the initial pool of twelve objectives, without 
their corresponding items, were assigned to each subgroup resulting in 
a final subgroup composition of six objectives and sixteen items. Finally, 
three different forms of an item validation questionnaire were constructed 
by assigning each of the three subgroups of items and objectives to one 
of three judgemental procedures, the Hemphill-Westie categorizing tech- 
nique, the semantic differential rating tecbnique and tbe matching tech- 
nique. All three judgemental procedures were described in previous 
sections. The form of each questionnaire is as follows: 

Judgemental Procedure 

Categorizing Rating Matching 

Questionnaire Form 

1 Subgroup One Subgroup. Two Subgroup Three 

2 Subgroup Two Subgroup Three Subgroup One 

3 Subgroup Three Subgroup One Subgroup Two 



Er|c 18 



TABLE 1 

EXPECTED 14ATCH DETVJEEN THE TEST 'ITEMS 
AND THE OBJECTIVES THEY WERE 
DESIGJIED TO MEASURE 



Objective . Test Items 



1 


1. 


2 




2 


3 . 


A. 7 . 


9 


3 

4 


5, 


6, 8, 


10 


4 


11, 


12, 


■ ^ ■ 1 

13, 14, 15, 16, 17, IS, 19, 




20, 


21 




5 


22, 


23 




6 


24, 


25 




7 


• 26, 


27, 


28 ■ 


8 


29, 


30, 


31 


9 


32, 


33, 


34 


10 


35, 


36, 


37 


11 


38, 


39, 


40 



ERIC 



19 



-18- 

i : . 

TA13LE 2 

i 

EXPECTED MATCH BETVJEEN THE TEST ITEMS 
AND THE OBJECTIVES THEY ARE 
DESIGNED TO MEASURE 



ODjective 








Test Items 


1 


1/ 


13, 


25, 


37 


2 


2, 


14, 


26, 


38 


3 


3, 


15, 


27, 


39 


• 4 - 


4, 


16, 


28, 


40 


5 


5, 


17, 


29, 


41 


6 


6, 


18, 


30, 


42 


7 


7, 


19, 


31, 


43' 


S 


s. 


20, 


32, 


44 












\ 9 




21, 


33, 


45 


10 


10, 


22, 


34, 


46 


11 


11, 


23, 


35, 


47 - 


12 


12, 


24, 


,36, 


48 



20 



-19- 



Ten science teachers (not the same teachers as In Study One) 
were randomly assigned to complete each form of the Questionnaire, 
Thus for any one subgroup of objectives and Items ^ there was Informa- 
tion available from three different groups of content specialists 
using three different Judgemental procedures. 

The data collected from both studies were examined^ where appropl- 
ate^ with regard to the following questions; 

1. Does the Judgemental data provide Information which c^n 

be used to assess the ext nt to which an Item Is a measure 
of an Instructional objecctie? 

2. Is the Information obtained ellable In the sense that 
there is consistency of agreement amongst the content 
specialists? 

3. Is the data valid? 

The Hemphlll-Westle Categorizing Procedure 

For both Studies One and Two^ a decision was made to set the 
cutoff score for the Index of Item-objective congruence^ the numer- 
ical representation of the Hemphill-Westle data ^ to be .70* That Is, 
Items having Item-objective congruence Indices less than ,70 were Iden- 
tified as not being valid measures of their Intended objectives* The 
results of the calculation of these Indices are presented In Tables 3 
and 4. In Study One, Items 3^ 4^ 7, 8, 9, 10, 15^ 18^ 19^ 20^ 26^ 31 
and 34 were Identified as not being valid measures of the Intended ob- 
jectives. In Study Two, Items 8, 10, 13, 14, 16, 22, 23, 24, 35, 40 
and 41 were Identified as not being valid measures of the intended ob- 
J actives . 

The Hemphlll-Westle procedure requires that the content special*- 
Ists judge each Item against all of the objectives* If an Item is Judged 

21 

o 

ERIC 



ERIC 



1 

2 
3 
4 
5 

6 
7 
8 
. 9 
10 

11 
12 
13 
14 
15 

16 
17 
18 
19 
20 

21 
22 
23 
24 
25 

26 
27 
28 
29 
30 

31 
32 
33 
34 
35 

36 
37 
38 
39 
40 



TABLE 3 

VALUES FOR THE INDEX OF ITEM OBJECTIVE 
CONGRUENCE ON TEST ITEMS IN DATA SET ONE 



Test 










Objectives 








Item 


1 


2 


3 


4 


5 6 7 8 


9 


10 


11 



,80 
,70 



,57 
.61 



56 
50 



.77 
.77 

.50 
.63 



,93 
,93 
,93 
,91 
,50 

,93 
95 
,54 
,35 
,21 

85 



,82 
,80 



.92 
.92 



.62 
.89 
.73 



,81 

.85 

.38 



.82 
.72 
.59 



.94 

.94 
.82 



22 



.87 
.82 
.82 



TABLE 4 

VALUES FOR THE INDEX OP ITEM-OBJECTIVE CONGRUENCE AND THE 
SD STATISTIC FOR DATA SET TWO 



(Index/SD Statistic) 



Objective 


Test 


























Subgroup 


Item 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11- 


12 



B 1 .81/. 69 

13 .62/. 46 

25 .83/. 79 

37 .72/. 78 



A 2 .82/. 57 

U .50/. 47 

26 .82/. 74 

38 .84/. 81 

B 3 .90/. 50 

15 .98/. 50 

27 • .92/. 82 

39 .86/. 55 

C". 4 .83/. 61 

16 .40/. 32 

28 .37/. 40 

40 .60/. 30 

*C 5 .96/. 78 

17 .85/. 59 

29 .95/. 57 

41 .63/. 41 



c 

4i 
C 

o 



9 



4J 6 

■to 0) 

H H 



9* 

4J O 
U U 



^ ^ %o 0\ 

in m %o m 



O CO 00 00 

^ 00 00 



fH CM m «H 
tn m 



tn o r* CM 
CO in C4 c\ 



tn CM o\ 
%o m 



CM CM O 

\o \o 00 



O ^ ^ 00 
00 ^ 
* * * * 

O m ^ ^ 

0\ 00 00 CO 



OH H c\ 
^ %o in r* 



r* m 
r* r* 00 



tn *H O 00 
%o %o r* %o 



^ CO 00 



^ fH ^ 
m m 



in 00 o 
r* r* so CO 



\o 00 O CM 
*H 0^ 



a\ r4 <n 
*H tn 



00 O CM 

CM tn 



CM *H CM 



iH ^ m cM**^ooo 



ERLC 



24 



-23- 



to be a measure of more than one objective, Its Item-objective con- 
gruence Index will be lowered. For both studies, the Item-objective 
congruence Indices were always considerably higher when the items were 
matched to the Intended objectives than when they were matched to the 
other objectives. It appeared that the content specialists could make' 
meaningful judgements in the assessment of Item valldi^^y. 

The next analyses were concerned with determining whether levels 
of Item-objective congruence Indices were based on reliable data- That 
Is, were the content specialists consistent in their judgements of the 
test Items? The assessment of the consistency of agreement amongst 
judges was made by calculating Lu^s (1971) coefficient of agreement. A 
coefficient of agreement was obtained for each objective subgroup for 
both Data Sets One and Two. The results are presented In Tables 5 and 6* 
For all twenty-three objectives, the coefficients of agreement were 
slgnlgicant. These findings support the hypothesis that the Hemphlll- 
Westie judgemental data was reliable In the sense that there was sub- 
stantial consistency of agreement amongst the judges. 

For the purposes of this study, validity of the judgemental data 
was defined as the degree of agreement between different groups of con- 
tent specialists assessing Item validity through the use of different 
judgemental procedures. For Study One, there was insufficient data to 
check for validity. For Study Two, the degree of agreement was obtained 
by correlating two rank orderlngs of the items based on the sized of 
judgemental statistics calculated from the categorizing and rating pro- 
cedures. The first rank ordering of the Items was established by using 
values of the Index of Item objective congruence. The second rank order" 



erJc 



25 



-24- 



TABLE 5 

LU*S COEFFICIENT OF AGREEMENT FOR THE 
OBJECTIVE SUBGROUPS OF DATA SET ONE 



Objective Lu's Coefficient x statistic (df) 



1 


,83 


^\ 


(819) 




.16* 


2 


,86 ■ 


^\ 


(819) 




.13* 


3 


,90 


^\ 


(819) 




.08* 


4 


,91 


^\ 


(819) 




.07* 


5 


,88 


^\ 


(819) 




.10* 


6 


.90 


yr 


(819) 




.07* 


7 


.91 


^\ 


(819) 




.08* 


8 


,94 


^\ 


(819) 




.02* 


9 


.89 


^\ 


(819) 




.09* 


10 


,88 


^\ 


(819) 




.11* 


11 


.91 




(819) 




.08* 



*p<.Ol 



erJc 



2'6 



TABLE 6 

LU*S COEFFICIENT OF AGREEMENT FOR THE 
OBJECTIVE SUBGROUPS OF DATA SET TWO ^ 



Objective Lu's Coefficient statistic (df) 



1 


.80 


x2 


(112) 


— 


.20* 


2 


.8j 




(:28) 




.16* 


3 


.67 




(112) 




.33* 


4 


.57 




(128) 




.41* 


5 


.86 




(128) 




.14* 


6 


.75 




(128) 




.25* 


7 


.88 


^2 


(128) 




.11* 


8 


.74 


^^2 


(128) 




.26* 


9 


.83 


^^2 


(128) 




.16* 


10 


.88 


^^2 


(112) 




.13* 


11 


i83 


^^2 


(128) 




.16* 


12 


.83 




(112) 




.16* 



*p<.01 



-26- 

±ng of the items was established by using values of the index of item , 

objective congruence. The second rank ordering was established by 

using values of a statistic (SD) calculated from the semantic differ*- 

ential ratings on the items. This statistic was computed using the 

following algorithm; 

a* Compute the sum (y^^) of the ratings for each item, on the 

objective to which it was matched, across cont nt specialists. 

b. Compute the sum (y2) of the ratings for each item on the 
remaining objectives across content specialists. 

c. Compute the rank order statistic (SD) from the ratio. of 
sum one (y^) to sum two (y ) ^ For a rating scale having 
values from one to k, this statistic (SD) has a maximum - 
value given as 

max (SD) = nk or nk = k 

n(K"l) (k-(k-l) ) n(N-l) N-1 

The minimum value for SD is given as 

min (SD) = n = 1 

n(N-l) k (N-1) k 

where n is the number of content specialists, 

N is the number of objectives, and 

k is the highest value of the rating scale. 

For Study Two, with six objectives per Judgemental subgroup, 

th^ maximum value for the SD statistic is 1 and the minimum value is 

.04. 

For each of the three subgroups of objectives, consisting of 16 
items each. Spearman's coefficient of rank difference was calculated 
between the item-objective congruence indices and the item SD statis- 
tics* The three Spearman coefficients reported in Table 7 were sta- 
tistically significant and above .65 , suggesting the substantial agree- 
ment as to the quality of test items across the two methods for Judging 
Items « 

28 



TABLE 7 

i 

RANK ORDER CORRELATIONS OF ITEM OBJECTIVE 
CONGRUENCE INDICES AND THE SD STATISTIC 
FOR DATA SET TWO 



Objective 
Group 




Test Items 


Rank 

Difference 
Correlation 


Statistic 


A 


2, 

21, 

43, 


7, 8, 9, 14, 19, 20, 
26, 31, 32, 33, 38, 
44, 45 


.82 


5.31* 


B 


1/ 

22, 

37, 


3, 10, 12, 13, 15, 
24, 25, 27, 34, 36, 
39, 46, 48 


.66 


3.30* 


C 


.4, 

23, 

41, 


5, 6, 11, 16, 17, 18, 
28, 29, 30, 35, 40, 
42, 47 


.67 


3.38* 



*p<.Ol 



29 



-28- 

The Semantic Differential Rating Procedure 

For Study Two^ the second judgemental procedure required that 
the content specialists assign a semantic differential like rating of 
from one to five to an item depending on whether the item was judged 
as an irrelevant cr relevant measure of the objective in question. 
The fact that the content specialists consistently rated items higher 
on the intended objectives than on the other objectives was taken as 
an indication that this data did provide meaningful information for 
assessing item validity. However^ one problem associated with the use 
of these ratings is that they do not provide information on whether 
OP not the items were judged to be a measure of more than one objective. 
Therefore^ the SD statistics discussed previously were computed for the 
items as it takes into consideration the ratings assigned to the item 
for the other objectives. It was arbitrarily decided that items having 
SD values less than .50 would be identified as not being valid measures 
of the objectives to which they were matched. For Data Set Two^ items 
2, 8, 10, 13, 14, 16, 22, 23, 35, and 40 were identified as invalid. 

An assessment of the reliability of these ratings was made through 
an examination of the standard deviations of the ratings on an item and 
the objective to which it was matched. With the exception of a few items 
these standard deviations were quite small, indicating that the content 
specialists were making the same ratings with respect to the item. These 
results are presented in Table 8. 

The Matching Procedure 

For the matching technique the content specialists were asked to 
match each item to the objective they felt it measured. The data col- 



ERIC 



30 



TABLE 8 

SEMANTIC DIFFERENTIAL RATINGS ON THE 
TEST ITEMS FROM DATA SET TWO 



Test 


SD Rating 




Standard 


Test 


SD Rating 




Standard 


Item 


Coeff . 


Mean 


Deviation 


Item 


Coeff. 


Mean 


Deviation 


1 


.69 


4.8( 


.46 


25 


..79 


4.3 


.95 


2 


.57 


4.7 


.45 


26 




4.6 


.48 


3 


.50 


4.2 


.70 


27 


.82 


4.7 


..48 


4 


.61 


4.6 


.52 


28 


.40 


5.0 


.00 


5 


.78 


5.0 


.00 


29 


.57 


4.4 


.95 


6 


.54 


4.9 


.31 


30 


.49 


4.8 


.32 


7 


.63 


5.0 


.00 


31 


-70 


5.0 


.00 


8 


• 49 


4.2 


1.21 


32 


.59 


4.7 


.45 


9 


.80 


5.0 


.00 


33 


.74 


4.6 


.48 


10 


.43 


4.0 


.78 


34 


.69 


4.5 


.53 


11 


.41 


5.0 


.00 


35 


.35 


4.7 


.48 


12 


.56 


4.7 


.48 


36 


.66 


5.0 


.00 


13 


.46 


4.1 


.80 


37 


.78 


4.6 


.52 


14 


' .47 


4.2 


.75 


38 


.81 


5.0 


.00 


15 


.50 


4.2 


.55 


39 


.55 


4.7 


.48 


16 


.32 


4.9 


.32 


40 


.30 


4.7 


.48 


17 


.59 


4.2 


.70 


41 


.41 


3.9 


.88 


18 


.51 


4.7 


.48 


42 


.36 


3.5 


1.27 


19 


.61 


4.7 


.45 


43 


.68 


4.7 


.45 


20 


.61 


4.5 


.50 


44 


.77 


4.8 


.40 


21 


.76 


4.8 


.40 


45 


.68 


4.7 


.45 


22 


.'42 


5.0 


.60 


46 


.54 


4.8 


.40 


.23 


.42 


4.8 


.40 


47 


.51 


4.0 


.82 


24 


. 54 


5.0 


.00 


48 


.59 


4.9 


.31 



31 



TABLE 9 

CONTINGENCY TABLES FOR DATA COLLECTED FROM THE CONTENT SPECIALISTS 
IN THE TEST ITEMS TO THE OBJECTIVES IN DATA SET TWO 



Objective Subgroup A 



Objective Subgroup B 



Objective Subgroup C 



Test 






Objective 




Test 




Objective 




Test 




Objective 






Item 


1 


2 


8 






Item 


12 


7 4 1 


3 


10 


Item 


6 


11 2 8 


5 


4 


9 










10 


34 






1 

X 


9 


11 




8 






19 




1 




. 9 




3 






10 




42 


6 


2 






32 


1 




Q 






48 


10 








29 






8 




26 




10 








13 


2 


8 






18 


8 








7 




1 


1 


8 




12 


10 








28 




3 




5 


44 




2 


8 






46 








10 


47 


2 


4 




2 


33 


2 








8 


37 




10 






16 




1 




7 


14 




8 


2 






15 






10 




6 


8 








31 








10 




24 


10 








35 




5 




3 


45 










10 


25 


5 


5 






17 






8 




. 8 


4 




6 






1 


2 


8 






40 




1 




7 


2 




10 








27 






10 




4 


1 






7 


20 


1 




9 






22 








10 


5 - 






8 




21 










10 


39 






10 




23 




7 




1 


38 




9 


1 






36 


10 








41 




4 


4 




43' 








10 




10 








10 


30 


7 


1 







to 



ERIC 



-31^ 

lected from the use of this technique is different from the' data col- 
lected from the use of the other two techniques in that the content 
specialists were not required to judge each item on all the objectives. 

An (ra X N) contingency table of items (m) and objectives (N) ivas 
constructed. The mN cell frequencies consisted of the number of times 
a content specialists matched an item to a particular objective. Dis- 
crepancies between the expected matches and the actual matches were used 
to identify invalid items* A minimum criterion that seventy percent 
of the content specialists must have correctly matched an item to an 
objective before the item could be declared valid was established. Using 
this criterion, the results presented in Table 9 show that for Data Set 
Two, items 8, 25, 28» 35, 41 and 47 were identified as not having item 
validity. The relatively high number of correct matches is an indication 
that this information can be used to assess item validity. 

One means for assessing the reliability of the data collected 
through the use of a matching technique is to calculate the amount of 
agreement between the expected matches and the actual standard. Light 
(1971) has developed a statistic (G) which provides a numerical repre- 
sentation of this amount of agreement which can be tested statistically 
for significance. However, because of the relatively small number of 
judgements required of the content specialists, it was not calculated 
for this data. 

The data collected using the matching technique did not lend itself 
to the assessment of validity as defined in this study. Therefore, no 
determination of the validity of this data was made. 

Summary and Conclusions 

In this study, three techniques for collecting and analyzing the 



33 



-32- 



judgements ot content specialists as a means for assessing item validity 
were discussed* All throe techniques were shown to provide Information 
which could be used to ascertain If an Item was a measure of an objective. 
However, there were differences In the types of data which were collected 
through the use of these techniques* For example, there were many more 
low SD statistics than low Item-objective congruence Indices for the same 
Items. This Is an Indication that the content specialists when using 
the semantic differential rating procedure judged the Items to be rele- 
vant meausurs of objectives other than the Intended ones more often than 
when using the categorizing procedure. It appears that these two pro- 
cedures are tapping different dimensions. 

Given the task of judging which items are measures of Intended ob- 
jectives, the Hemphill-llVestie procedure is recommended over the other 
two techniques. Two statements are offered in support of this recommenda- 
tion. One, the numeric representation of the data, the index of item- 
objective congruence, provides a meaningful interpretation of the extent 
to which an item is judged to be a valid measure of the Intended objective* 
Two, there are means for determining the reliability and validity of 
the data collected* Further, these methods can be tested for signifi- 
cance* 

On the other hand, there are drawbacks to the use of the Hemphlll- 
Westle procedure which could be rectified through the use of other judge- 
mental techniques. These drawbacks £ire given as follows: 

1. The procedure cannot be used to collect information on 
such topics as quality of the item, and type of dis- 
tractors, 

2. The dimensionality of the data must be known in advance 
of its use. 



34 



-33- 



3, The procedure is quite time consuming particularly if 
the numbers of items and of objectives are large* 

Thus, before selecting the type of Judgemental procedure to use, 

the test constructor should take into consideration the information 

desired and the resources available and then choose the most appropriate 

procedure. 

Basic to an effective criterion-referenced testing program is a set 
of test itejtts that are **valid" indicators of the objectives they were 
designed to measure. Unfortunately, the matter of assessing item vali- 
dity has recieved only limited discussion in the voluminous criterion- 
referenced testing literature. It is clear from this study that one 
promising approach to the item validity question is through the collection 
and analysis of the judgements of content specialists. 

Our expectation is that the results reported ia the study will 
provide some direction for the continued development of methodologies 
for the collection and analysis of content specialists' Judgements. 



ERIC 



35 



REFERENCES 



Anderson^ R.C. lEow to construct achievement tests to assess compre- 
hension- Review of Educational Research ^ 1972, 42, 145-170. 

Bormiith , J- R. On the theory of achievement test items . Chicago: 
University of Chicago Press, 1970. 



Brennan, R.L. , and Light , R,J. 

are not predetermined . 



Measuring agreement w h en categories 
Boston; Laboratory of Human Develop 



ment, Harvard University, 1973. 

Brennan. R,L. and Stolurow, L-M. An empirical decision process 
for formative evaluation. Research Memorandum No. 4 
Harvard CAI Laboratory, Cambridge, Mass., 1971* 



Cohen, J. A coefficient of agreement for nominal scales. 

and Psychological Measurement ^ 1960, 20, 37-46. 

Fleiss, j^L. Measuring nominal agreement among many raters, 
logical Bulletin, 1971, 76, 378-382. 



Educational 



Psycho- 



Fremer, J, Criterlon*-ref erenced interpretations of achievement 

tests . Test Development Memorandum mM*71-l . Princeton , 
N.J.: Educational Testing Service, 1972.^ 

Girard, R. , and Cliff, N. A comparison of methods for judging the 
similarity of personality inventory items. Multivariate 
Behavioral Research, 1973, 8, 71-88* 

Hambleton, R.K. and Novick, M-R. Toward an integration of theory 
and method for criterion-referenced tests. Journal of 
Educational Measurement , 1973, 10, 159-170* 

Hemphill. J., and Westie, CM. The measurement of group dimensions. 
Journal of Psychology , 1950, 29, 325-342. 

Hively, W., Maxwell^ G. , Rabehl, G. , Sension, D. , and Lund n, S. 

Domain'^referenced curriculum evaluation: a techni-* 
cal handbook and a case study from the Minnemast 
Project. Monograph Series in Evaluation^ No^l . 
Los Angeles: Center for the Study of Evaluation^ 
University of California, 1972* 

Hively, W. , Patterson, H.L. f and Page> S. A "universe-defined** 
system of arithmetic achievement tests^ Journal of 
Educational Measurement, 1968, 5, 275-290* 



36 



Light, R.J, Issues in the analysis of qualitative data. In 

R. Travers (Ed.), Second handbook of research on teaching . 
Chicago: Rand McNally, 1971, 318-381. 

Livingston, s.A. Criterion**ref erenced applications of classical test 

theory* Journal of Educational Measurement , 1972, 9, 13-26 ^ 

Lu, K.H. A measure of agreement among subjective judgements. 

Educational and Psychological Measurement , 1971, 31, 75-84. 

>lillman, J, Criterion-referenced measurement. In W.J. Popham (Ed.), 
Evaluation in education: Current practices . San 
Francisco: McCutchen Publishers, 1974. 

Osgood, C.E. Suci, G.J. , and Tannenbaum. p.H. The measurement . 

of meaning . Urbana: University of Illinois Press, 1957. 

Popham, W.J. Indices of adequacy for criterion-referenced test 
items. in W.J. Popham (Bd.), Criterion-referenced 
measurement . Englewood Cliffs, N,J.; Educational 
Technology Publications, 1971. 

Popham, W.J. and Husek. T.R. Implications of criterion-referenced 
measurement. Journal of Educational Measurement^ 1969, 
6, 1-9. 

Rovinelli, R,j. Methods for validating criterion^ref erenced test 

items . Unpublished doctoral dissertation, University "of 
Massachusetts, Amherst, 1976 » 



Ryan, J, J. Teacher judgements of test item properties, 
of Educational Measurement, 1968, 5, 301-306. 



Journal 



Skager, R.W. Generating criterion-referenced tests from objective- 
based assessment systems: unsolved problems in test 
development, assembly , and interpretation. CSE Mono - 
graph Series in Evaluation . Los Angeles: Center for 
the Study of Evaluation, UCLA, 1974. 



ERIC 



37 



